AITopics | synthetic example

Collaborating Authors

synthetic example

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Embracing the chaos: analysis and diagnosis of numerical instability in variational flows

Neural Information Processing SystemsFeb-12-2026, 23:13:21 GMT

In this work, we treat variational flows as dynamical systems, and leverage shadowing theory to elucidate this behavior via theoretical guarantees on the error of sampling, density evaluation, and ELBO estimation.

artificial intelligence, dynamical system, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Tennessee > Knox County > Knoxville (0.04)
North America > Canada > British Columbia (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Africa > Mali (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Mathematics of Computing (0.67)

Add feedback

Privacy Preserving In-Context-Learning Framework for Large Language Models

Bhusal, Bishnu, Acharya, Manoj, Kaur, Ramneet, Samplawski, Colin, Roy, Anirban, Cobb, Adam D., Chadha, Rohit, Jha, Susmit

arXiv.org Artificial IntelligenceNov-20-2025

Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models. The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility. Our code is available at https://github.com/bhusalb/

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.13625

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Embracing the chaos: analysis and diagnosis of numerical instability in variational flows

Neural Information Processing SystemsOct-8-2025, 19:59:21 GMT

artificial intelligence, dynamical system, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Tennessee > Knox County > Knoxville (0.04)
North America > Canada > British Columbia (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Africa > Mali (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Mathematics of Computing (0.67)

Add feedback

SMOTExT: SMOTE meets Large Language Models

Bystroński, Mateusz, Hołysz, Mikołaj, Piotrowski, Grzegorz, Chawla, Nitesh V., Kajdanowicz, Tomasz

arXiv.org Artificial IntelligenceMay-20-2025

Data scarcity and class imbalance are persistent challenges in training robust NLP models, especially in specialized domains or low-resource settings. We propose a novel technique, SMOTExT, that adapts the idea of Synthetic Minority Over-sampling (SMOTE) to textual data. Our method generates new synthetic examples by interpolating between BERT-based embeddings of two existing examples and then decoding the resulting latent point into text with xRAG architecture. By leveraging xRAG's cross-modal retrieval-generation framework, we can effectively turn interpolated vectors into coherent text. While this is preliminary work supported by qualitative outputs only, the method shows strong potential for knowledge distillation and data augmentation in few-shot settings. Notably, our approach also shows promise for privacy-preserving machine learning: in early experiments, training models solely on generated data achieved comparable performance to models trained on the original dataset. This suggests a viable path toward safe and effective learning under data protection constraints.

interpolation, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.13434

Country:

North America > United States > Minnesota (0.28)
Europe (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.83)

Add feedback

Synthetic Text Generation for Training Large Language Models via Gradient Matching

Nguyen, Dang, Li, Zeman, Bateni, Mohammadhossein, Mirrokni, Vahab, Razaviyayn, Meisam, Mirzasoleiman, Baharan

arXiv.org Artificial IntelligenceFeb-24-2025

Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that guarantees the convergence and performance of LLMs during fine-tuning on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text can guarantee convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data. Experiments on various classification tasks confirm the effectiveness of our proposed approach.

arxiv preprint arxiv, real data, synthetic data, (14 more...)

arXiv.org Artificial Intelligence

2502.17607

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)

Genre: Research Report (0.64)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Is Long Context All You Need? Leveraging LLM's Extended Context for NL2SQL

Chung, Yeounoh, Kakkar, Gaurav T., Gan, Yu, Milne, Brenton, Ozcan, Fatma

arXiv.org Artificial IntelligenceFeb-13-2025

Large Language Models (LLMs) have demonstrated impressive capabilities across a range of natural language processing tasks. In particular, improvements in reasoning abilities and the expansion of context windows have opened new avenues for leveraging these powerful models. NL2SQL is challenging in that the natural language question is inherently ambiguous, while the SQL generation requires a precise understanding of complex data schema and semantics. One approach to this semantic ambiguous problem is to provide more and sufficient contextual information. In this work, we explore the performance and the latency trade-offs of the extended context window (a.k.a., long context) offered by Google's state-of-the-art LLM (\textit{gemini-1.5-pro}). We study the impact of various contextual information, including column example values, question and SQL query pairs, user-provided hints, SQL documentation, and schema. To the best of our knowledge, this is the first work to study how the extended context window and extra contextual information can help NL2SQL generation with respect to both accuracy and latency cost. We show that long context LLMs are robust and do not get lost in the extended contextual information. Additionally, our long-context NL2SQL pipeline based on Google's \textit{gemini-pro-1.5} achieve strong performances on various benchmark datasets without finetuning and expensive self-consistency based techniques.

information, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2501.12372

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Low-rank Bayesian matrix completion via geodesic Hamiltonian Monte Carlo on Stiefel manifolds

Cui, Tiangang, Gorodetsky, Alex

arXiv.org Machine LearningOct-26-2024

We present a new sampling-based approach for enabling efficient computation of low-rank Bayesian matrix completion and quantifying the associated uncertainty. Firstly, we design a new prior model based on the singular-value-decomposition (SVD) parametrization of low-rank matrices. Our prior is analogous to the seminal nuclear-norm regularization used in non-Bayesian setting and enforces orthogonality in the factor matrices by constraining them to Stiefel manifolds. Then, we design a geodesic Hamiltonian Monte Carlo (-within-Gibbs) algorithm for generating posterior samples of the SVD factor matrices. We demonstrate that our approach resolves the sampling difficulties encountered by standard Gibbs samplers for the common two-matrix factorization used in matrix completion. More importantly, the geodesic Hamiltonian sampler allows for sampling in cases with more general likelihoods than the typical Gaussian likelihood and Gaussian prior assumptions adopted in most of the existing Bayesian matrix completion literature. We demonstrate an applications of our approach to fit the categorical data of a mice protein dataset and the MovieLens recommendation problem. Numerical examples demonstrate superior sampling performance, including better mixing and faster convergence to a stationary distribution. Moreover, they demonstrate improved accuracy on the two real-world benchmark problems we considered.

data mining, machine learning, stiefel manifold, (14 more...)

arXiv.org Machine Learning

2410.20318

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.49)

Add feedback

MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task

Juraska, Juraj, Deutsch, Daniel, Finkelstein, Mara, Freitag, Markus

arXiv.org Artificial IntelligenceOct-4-2024

In this paper, we present the MetricX-24 submissions to the WMT24 Metrics Shared Task and provide details on the improvements we made over the previous version of MetricX. Our primary submission is a hybrid reference-based/-free metric, which can score a translation irrespective of whether it is given the source segment, the reference, or both. The metric is trained on previous WMT data in a two-stage fashion, first on the DA ratings only, then on a mixture of MQM and DA ratings. The training set in both stages is augmented with synthetic examples that we created to make the metric more robust to several common failure modes, such as fluent but unrelated translation, or undertranslation. We demonstrate the benefits of the individual modifications via an ablation study, and show a significant performance increase over MetricX-23 on the WMT23 MQM ratings, as well as our new synthetic challenge set.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.03983

Country:

Asia > Middle East > Iran (0.14)
Asia > Singapore (0.05)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
(6 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.48)

Add feedback

Exploring Description-Augmented Dataless Intent Classification

Hu, Ruoyu, Khosmood, Foaad, Edalat, Abbas

arXiv.org Artificial IntelligenceJul-25-2024

In this work, we introduce several schemes to leverage description-augmented embedding similarity for dataless intent classification using current state-of-the-art (SOTA) text embedding models. We report results of our methods on four commonly used intent classification datasets and compare against previous works of a similar nature. Our work shows promising results for dataless classification scaling to a large number of unseen intents. We show competitive results and significant improvements (+6.12\% Avg.) over strong zero-shot baselines, all without training on labelled or task-specific data. Furthermore, we provide qualitative error analysis of the shortfalls of this methodology to help guide future research in this area.

classification, computational linguistic, dataset, (14 more...)

arXiv.org Artificial Intelligence

2407.17862

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York (0.05)
North America > United States > California > Los Angeles County > Los Angeles (0.04)
(21 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Transportation > Passenger (1.00)
Transportation > Air (1.00)
Leisure & Entertainment (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

Private prediction for large-scale synthetic text generation

Amin, Kareem, Bie, Alex, Kong, Weiwei, Kurakin, Alexey, Ponomareva, Natalia, Syed, Umar, Terzis, Andreas, Vassilvitskii, Sergei

arXiv.org Artificial IntelligenceJul-16-2024

We present an approach for generating differentially private synthetic text using large language models (LLMs), via private prediction. In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees. This is in contrast to approaches that train a generative model on potentially sensitive user-supplied source data and seek to ensure the model itself is safe to release. We prompt a pretrained LLM with source data, but ensure that next-token predictions are made with differential privacy guarantees. Previous work in this paradigm reported generating a small number of examples (<10) at reasonable privacy levels, an amount of data that is useful only for downstream in-context learning or prompting. In contrast, we make changes that allow us to generate thousands of high-quality synthetic data points, greatly expanding the set of potential applications. Our improvements come from an improved privacy analysis and a better private selection mechanism, which makes use of the equivalence between the softmax layer for sampling tokens in LLMs and the exponential mechanism. Furthermore, we introduce a novel use of public predictions via the sparse vector technique, in which we do not pay privacy costs for tokens that are predictable without sensitive data; we find this to be particularly effective for structured data.

dataset, privacy, synthetic data, (16 more...)

arXiv.org Artificial Intelligence

2407.12108

Country:

North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > Jordan (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback